AITopics | trec 2024

Collaborating Authors

trec 2024

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Auto-ARGUE: LLM-Based Report Generation Evaluation

Walden, William, Mason, Marc, Weller, Orion, Dietz, Laura, Conroy, John, Molino, Neil, Recknor, Hannah, Li, Bryan, Liu, Gabrielle Kaili-May, Hou, Yu, Lawrie, Dawn, Mayfield, James, Yang, Eugene

arXiv.org Artificial IntelligenceOct-20-2025

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.26184

Country:

North America > United States (0.29)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Ondov, Brian, Xia, William, Attal, Kush, Unde, Ishita, He, Jerry, Demner-Fushman, Dina

arXiv.org Artificial IntelligenceJul-23-2025

Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2507.14096

Country:

North America > United States (1.00)
Europe (0.93)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Consumer Health (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.46)
Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.46)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Thakur, Nandan, Pradeep, Ronak, Upadhyay, Shivani, Campos, Daniel, Craswell, Nick, Lin, Jimmy

arXiv.org Artificial IntelligenceApr-22-2025

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.15205

Country: North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (0.30)
Leisure & Entertainment (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text

Kocbek, Primoz, Kopitar, Leon, Zhang, Zhihong, Aydin, Emirhan, Topaz, Maxim, Stiglic, Gregor

arXiv.org Artificial IntelligenceFeb-19-2025

This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8 - level audience (13 - 14 years old students). We tested three approaches using OpenAI's gpt - 4o and gpt - 4o - mini models: baseline prompt engineering, a two - AI agent approach, and fine - tuning. Adaptations were evaluated using qualitative metrics ( 5 - point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch - Kincaid grade level, SMOG Index). Results indicate d that the two - agent approach and baseline prompt engineering with gpt - 4o - mini models show superior qualitative performance, while fine - tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt - 4o - mini outperforms iterative improvement strategies via two - agent approach as well as fine - tuning with gpt - 4o. We intend to expand our investigation of the results and explore advanced evaluations.

adaptation, gpt, grade level, (14 more...)

arXiv.org Artificial Intelligence

2502.14144

Country:

Europe > Slovenia > Drava > Municipality of Maribor > Maribor (0.05)
Asia > Middle East > Republic of Türkiye > Manisa Province > Manisa (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

Upadhyay, Shivani, Pradeep, Ronak, Thakur, Nandan, Campos, Daniel, Craswell, Nick, Soboroff, Ian, Dang, Hoa Trang, Lin, Jimmy

arXiv.org Artificial IntelligenceNov-12-2024

The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.08275

Country:

South America > Brazil > Bahia > Salvador (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Ragnar\"ok: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Pradeep, Ronak, Thakur, Nandan, Sharifymoghaddam, Sahel, Zhang, Eric, Nguyen, Ryan, Campos, Daniel, Craswell, Nick, Lin, Jimmy

arXiv.org Artificial IntelligenceJun-24-2024

Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.

arxiv, document collection, trec 2024, (11 more...)

arXiv.org Artificial Intelligence

2406.16828

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)
North America > United States > District of Columbia > Washington (0.05)
Europe > Spain > Galicia > Madrid (0.04)
(15 more...)

Genre: Research Report (0.50)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback